In this project we are supposed to predict the sentiment of a given text, And predict if the text is positive, negative or neutral in nature.

Importing necessary libraries.

Text Normalization

Text normalization is the process of transforming a text into a canonical (standard) form. For example, the word “gooood” and “gud” can be transformed to “good”, its canonical form. Another example is mapping of near identical words such as “stopwords”, “stop-words” and “stop words” to just “stopwords”

Tokenization

What is Tokenization? Tokenization is the process by which big quantity of text is divided into smaller parts called tokens.

Natural language processing is used for building applications such as Text classification, intelligent chatbot, sentimental analysis, language translation, etc. It becomes vital to understand the pattern in the text to achieve the above-stated purpose. These tokens are very useful for finding such patterns as well as is considered as a base step for stemming and lemmatization.

Sentence tokenization is the process of splitting text into individual sentences. ... It does this by looking for the types of textual constructs that confuse the tokenizer and replacing them with single words.

Alphanumeric characters

Removing Contractions.

It is a process where words like isn't, didn't are expanded to is not did not. isn't --> is not, I'm --> I am, they're --> they are, shouldn't --> should not, can't --> can not

Final Dataset.

Rows 0,1 and 1000 and 1001 have repeated hence needs to be cleaned.

Removing additional characters present in the dataframe.

Lemmatization of the text column

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. Lemmatization will generate the root form of the inflected words

Removing stopwords from the dataframe

Visualization

Visualizing the highest repeating words in the dataframe using the wordcloud.

Counting the number of times a word has repeated through out the data.

Visualizing top 10 repeated/common words using bar graph.

Sentiment Analysis.

Vader sentiment analysis is done in order to find if a given (Word) is positive, negative or neutral in nature.

VADER belongs to a type of sentiment analysis that is based on lexicons of sentiment-related words. In this approach, each of the words in the lexicon is rated as to whether it is positive or negative, and in many cases, how positive or negative. Below you can see an excerpt from VADER’s lexicon, where more positive words have higher positive ratings and more negative words have lower negative ratings. Vader sentiment analysis for a given (word) if positive, negative or neutral in nature.

Top 100 Positive words.

Top 100 Negative words.

Vader sentiment analysis for a given (Sentence) if positive, negative or neutral in nature.

Converting all Polarity scores and sentences into a dataframe.

Arranging the dataset in descending order based on (Compound score) to find the most important sentence from the given data.

Finding top positive sentence in the data.

Finding top negative sentence in the data.

Giving threshold values to classify if a given sentence is positive, negative or neutral in nature.

Adding the target or sentiment column to our data frame.

Removing/dropping the 'neg', 'neu', 'pos', and 'compound' columns.

Logistic Regression

Decision Tree Classifier

Support Vector Machine